NOTE: skip this section if you are not running R locally (e.g., if you are running R in your browser using a remote Jupyter server)
You should have R installed –if not:
Download workshop materials:
R is a programming language designed for statistical computing. Notable characteristics include:
OK, it’s free and popular, but what makes R worth learning? In a word, “packages”. If you have a data manipulation, analysis or visualization task, chances are good that there is an R package for that. Lets install some packages and look at some examples.
library(ggmap)
nwbuilding <- geocode("1737 Cambridge Street Cambridge, MA 02138", source = "google") ## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=1737%20Cambridge%20Street%20Cambridge,%20MA%2002138&sensor=false
ggmap(get_map("Cambridge, MA", zoom = 15)) +
geom_point(data=nwbuilding, size = 7, shape = 13, color = "red")## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Cambridge,+MA&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Cambridge,%20MA&sensor=false
library(forecast)
library(plotly)
## from https://esa.un.org/unpd/wpp/Download/Standard/Population/
worldpop <- structure(c(2.525149312, 2.571867515, 2.617940399, 2.66402901,
2.710677773, 2.758314525, 2.807246148, 2.85766291, 2.909651396,
2.963216053, 3.018343828, 3.075073173, 3.133554362, 3.194075347,
3.256988501, 3.322495121, 3.390685523, 3.461343172, 3.533966901,
3.607865513, 3.682487691, 3.757734668, 3.833594894, 3.90972212,
3.985733775, 4.061399228, 4.13654207, 4.211322427, 4.286282447,
4.362189531, 4.439632465, 4.518602042, 4.599003374, 4.681210508,
4.765657562, 4.852540569, 4.942056118, 5.033804944, 5.126632694,
5.218978019, 5.309667699, 5.398328753, 5.485115276, 5.57004538,
5.653315893, 5.735123084, 5.815392305, 5.894155105, 5.971882825,
6.049205203, 6.126622121, 6.204310739, 6.282301767, 6.360764684,
6.439842408, 6.51963585, 6.600220247, 6.68160732, 6.763732879,
6.846479521, 6.92972504300001, 7.013427052, 7.097500453, 7.181715139,
7.265785946, 7.349472099), .Tsp = c(1950, 2015, 1), class = "ts")
## Projected numbers (in billions) of humans living on earth
fit <- auto.arima(worldpop)
ggplotly(autoplot(forecast(fit)))comet <- rgl::readOBJ(url("http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726"))
plot_ly(x = comet$vb[1,],
y = comet$vb[2,],
z = comet$vb[3,],
i = comet$it[1,]-1,
j= comet$it[2,]-1,
k = comet$it[3,]-1,
type = "mesh3d")Whatever you’re trying to do, you’re probably not the first to try doing it R. Chances are good that someone has already written a package for that.
Coming from…
The old-school way is to run R directly in a terminal
But hardly anybody does it that way anymore! The Windows version of R comes with a GUI that looks like this:
The default windows GUI is not very good
RStudio (an alternative GUI for R) is shown below.
Rstudio has many useful features, including parentheses matching and auto-completion. Rstudio is not the only advanced R interface; other alteratives include Emacs with ESS (shown below).
Emacs + ESS is a very powerful combination, but can be difficult to set up.
Jupyter is a notebook interface that runs in your web browser. A lot of people like it. You can access these workshop notes as a Jupyter notebook at http://tutorials-live.iq.harvard.edu:8000/notebooks/workshops/R/Rintro/Rintro.ipynb
Note: skip this section if you are not using Rstudio (e.g., if you are running these examples in a Jupyter notebook).
Rintro.R script in the Rintro folder on your desktop# is a comment that will be ignored by R. My comments all start with ##; you can add your own, possibly using # or ### to distinguish your comments from mine.Now that we know what we’re getting into and have our environment set up, let’s get to work.
The purpose of this exercise is mostly to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to learn. If you don’t know how to do something you can can use internet search engines, search on StackOverflow, or ask the person next to you.
Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!
car. Try to install this package.2 + 2## [1] 4
sum(2, 2)## [1] 4
sqrt(10)## [1] 3.162278
10^(1/2)## [1] 3.162278
In Rstudio, go to the “Packages” tab and click the “Istall” button. Search in the pop-up window and click “Install”.
Alternatively, use the install.packages function like this:
install.packages("car")## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
=Go to the main help page by running ’help.start() or using the GUI menu, find and click on the link to “An Introduction to R”.=
I like the machine learning topic.
I would like to know what the most popular baby names are. In the course of answering this question we will learn to call R functions, install and load packages, assign values to names, read and write data, and more.
The examples in this workshop use the baby names data provided by the governments of the United States and the United Kingdom. A cleaned and merged version of these data is in dataSets/babyNames.csv.
Our first goal is to read these data into R. In order to do that we need to learn how to call functions, install packages, set out working directory, read as .csv file, and assign the result to a name. Lets get to it.
There are thousands of R packages that extend R’s capabilities. Some packages are distributed with R, and some of these are attached to the search path by default. Many more are available in package repositories.
In order to make reading and analyzing our baby names data easier we will install and use a collection of packages called tidyverse. tidyverse is a meta package that loads the dplyr package for easier data manipulation the readr package for easier data import/export, and several other useful packages.
Packages can be installed using the install.packages function.
The general form for calling R functions is
## FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)Arguments can be matched by position or name. Lets see how that works, using the install.packages function.
Since this is the first time we are using the install.packages function we will start by looking up its help page. This is almost always the first thing you should do when using a function for the first time. You can look up the help page for a function like this:
?install.packagesAs we can see from the documentation, the first (and only required) argument is named pkgs. Additional arguments specify where this package should be installed from (repos) and to (lib) among other things.
OK, lets install the “car” package from the repo at “https://cran.rstudio.com”.
install.packages("", repos = "https://cran.rstudio.com")## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
## Warning: package '' is not available (for R version 3.3.2)
Installing a package puts a copy of the package on your local computer, but does not make it available for use. To use an installed package you must attach it using the library function.
library("car")Now that we’ve installed the car package, how do we use it? We’ve already seen that we can look up the help page using ?. This is actually a shortcut to the help function:
help(help)The help function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the car package by reading its documentation like this:
help(package = "car")The purpose of this exercise is to practice using the package management and help facilities.
tidyverse package.library function to attach the tidyverse package..csv) file?## 1. install the tidyverse pacakge
install.packages("tidyverse")## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
## 2. attach the tidyverse pacakge
library("tidyverse")## 3. look up the readr package documentation
help(package = "readr")
## I would use read_tsv to read a tab delimited file.Now that we have installed and attached the tidyverse (and readr) packages, and know which function to use to read our data (read_csv) we are almost ready to read in the baby names data. Before we do that lets take a small excision to learn about assignment and basic data types in R.
Values can be assigned names and used in subsequent operations
<- operator (less than followed by a dash) is used to save valuesx <- 10 # Assign the value 10 to a variable named xx + 1 # Add 1 to x## [1] 11
x # note that x is unchanged## [1] 10
y <- x + 1 # Assign y the value x + 1
y## [1] 11
x <- x + 100 # change the value of x
y ## note that y is unchanged.## [1] 11
The x and y data objects we created are numeric vectors of length one. Vectors are the simplest data structure in R, and are the building blocks used to make more complex data structures. Here are some more vector examples.
x <- c(10, 11, 12)
X <- c("10", "11", "12")
y <- c("h", "e", "l", "l", "o")
Y <- "hello"
z <- c(TRUE, FALSE, TRUE, TRUE)Notice that the c function combines its arguments into a vector.
All R objects have a type (aka mode) and length. Since it is impossible for an object not to have these attributes they are called intrinsic attributes.
print(x)## [1] 10 11 12
typeof(x)## [1] "double"
length(x)## [1] 3
print(X)## [1] "10" "11" "12"
typeof(X)## [1] "character"
length(X)## [1] 3
print(y)## [1] "h" "e" "l" "l" "o"
length(y)## [1] 5
print(Y)## [1] "hello"
length(Y)## [1] 1
print(z)## [1] TRUE FALSE TRUE TRUE
typeof(z)## [1] "logical"
Data structures in R can be converted from one type to another using one of the many functions beginning with as.. For example:
print(x)## [1] 10 11 12
mode(x)## [1] "numeric"
mode(as.character(x))## [1] "character"
print(X)## [1] "10" "11" "12"
mode(X)## [1] "character"
mode(as.numeric(X))## [1] "numeric"
Now that we know how to do assignment using <- and how to understand basic data types in R we are finally ready to read in the baby names data.
R knows the directory it was started in, and refers to this as the “working directory”. Since our workshop examples are in the Rintro folder, we should all take a moment to set that as our working directory.
getwd() # what is my current working directory?
# setwd("~/Desktop/Rintro") # change directoryNote that “~” means “my home directory” but that this can mean different things on different operating systems. You can also use the Files tab in Rstudio to navigate to a directory, then click “More -> Set as working directory”.
We can a set the working directory using paths relative to the current working directory. Once we are in the “Rintro” folder we can navigate to the “dataSets” folder like this:
getwd() # get the current working directory## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro"
setwd("dataSets") # set wd to the dataSets folder
getwd()## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro/dataSets"
setwd("..") # set wd to enclosing folder ("up")
getwd()## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R"
It can be convenient to list files in a directory without leaving R
list.files("dataSets") # list files in the dataSets folder## [1] "babyNames.csv"
In order to read data from a file, you have to know what kind of file it is. The table below lists the functions that can import data from common file formats.
| data type | function | package |
|---|---|---|
| comma separated (.csv) | read_csv() |
readr (tidyverse) |
| other delimited formats | read_delim() |
readr (tidyverse) |
| R (.Rds) | read_rds() |
readr (tidyverse) |
| Stata (.dta) | read_stata() |
haven (tidyverse, needs to be attached separately) |
| SPSS (.sav) | read_spss() |
haven (tidyverse, needs to be attached separately) |
| SAS (.sas7bdat) | read_sas() |
haven (tidyverse, needs to be attached separately) |
| Excel (.xls, .xlsx) | read_excel() |
readxl (tidyverse, needs to be attached separately) |
The purpose of this exercise is to practice reading data into R. The data in “dataSets/babyNames.csv” is moderately tricky to read, making it a good data set to practice on.
read_csv function. How can you limit the number of rows to be read in?dataSets/babyNames.csv”. Notice that the “Sex” column has been read as a logical (TRUE/FALSE).read_csv help page to figure out how to make it read the “Sex” column as a character. Make adjustments to your code until you have read in the first 10 rows with the correct column types. “Year” and “Name.length” should be integer (int), “Count” and “Percent” should be double (dbl) and everything else should be character (chr).baby.names.## read ?read_csv
## limit rows with n_max argument
read_csv("dataSets/babyNames.csv", n_max = 10)## Parsed with column specification:
## cols(
## Location = col_character(),
## Year = col_integer(),
## Sex = col_logical(),
## Name = col_character(),
## Count = col_double(),
## Percent = col_double(),
## Name.length = col_integer()
## )
## specify column types in the col_types argument
read_csv("dataSets/babyNames.csv", n_max = 10, col_types = "??c????")
## read all the data
baby.names <- read_csv("dataSets/babyNames.csv", col_types = "??c????")It is always a good idea to examine the imported data set–usually we want the results to be a data.frame
## we know that this object will have mode and length, because all R objects do.
mode(baby.names)## [1] "list"
length(baby.names) # number of columns## [1] 7
## additional information about this data object
class(baby.names) # check to see that test is a data.frame## [1] "tbl_df" "tbl" "data.frame"
dim(baby.names) # how many rows and columns?## [1] 1966001 7
names(baby.names) # or colnames(baby.names)## [1] "Location" "Year" "Sex" "Name" "Count"
## [6] "Percent" "Name.length"
str(baby.names) # more details## Classes 'tbl_df', 'tbl' and 'data.frame': 1966001 obs. of 7 variables:
## $ Location : chr "England and Wales" "England and Wales" "England and Wales" "England and Wales" ...
## $ Year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
## $ Sex : chr "F" "F" "F" "F" ...
## $ Name : chr "sophie" "chloe" "jessica" "emily" ...
## $ Count : num 7087 6824 6711 6415 6299 ...
## $ Percent : num 2.39 2.31 2.27 2.17 2.13 ...
## $ Name.length: int 6 5 7 5 6 6 9 7 3 5 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 7
## .. ..$ Location : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Sex : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Count : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Percent : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Name.length: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
glimpse(baby.names) # details, more compactly## Observations: 1,966,001
## Variables: 7
## $ Location <chr> "England and Wales", "England and Wales", "England...
## $ Year <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", ...
## $ Name <chr> "sophie", "chloe", "jessica", "emily", "lauren", "...
## $ Count <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 52...
## $ Percent <dbl> 2.3942729, 2.3054210, 2.2672450, 2.1672444, 2.1280...
## $ Name.length <int> 6, 5, 7, 5, 6, 6, 9, 7, 3, 5, 7, 5, 7, 4, 4, 7, 5,...
Usually data read into R will be stored as a data.frame
A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)
You can extract subsets of data.frames using slice to select rows by number and filter to select rows that match some condition. It works like this:
## make up some example data
(example.df <- data.frame(id = rep(letters[1:4], each = 4),
t = rep(1:4, times = 4),
var1 = runif(16),
var2 = sample(letters[1:3], 16, replace = TRUE)))## id t var1 var2
## 1 a 1 0.25439062 a
## 2 a 2 0.18972348 c
## 3 a 3 0.70377912 b
## 4 a 4 0.40702740 c
## 5 b 1 0.87109466 c
## 6 b 2 0.48599201 c
## 7 b 3 0.24660803 a
## 8 b 4 0.40431428 a
## 9 c 1 0.65827318 b
## 10 c 2 0.75715090 c
## 11 c 3 0.98883031 a
## 12 c 4 0.52909527 a
## 13 d 1 0.03079849 b
## 14 d 2 0.30094221 a
## 15 d 3 0.42010827 c
## 16 d 4 0.07955623 b
## rows 2 and 4
slice(example.df, c(2, 4))## id t var1 var2
## 1 a 2 0.1897235 c
## 2 a 4 0.4070274 c
## rows where id == "a"
filter(example.df, id == "a")## id t var1 var2
## 1 a 1 0.2543906 a
## 2 a 2 0.1897235 c
## 3 a 3 0.7037791 b
## 4 a 4 0.4070274 c
## rows where id is either "a" or "b"
filter(example.df, id %in% c("a", "b"))## id t var1 var2
## 1 a 1 0.2543906 a
## 2 a 2 0.1897235 c
## 3 a 3 0.7037791 b
## 4 a 4 0.4070274 c
## 5 b 1 0.8710947 c
## 6 b 2 0.4859920 c
## 7 b 3 0.2466080 a
## 8 b 4 0.4043143 a
slice and filter are used to extract rows. select is used to extract columns
select(example.df, id, var1)## id var1
## 1 a 0.25439062
## 2 a 0.18972348
## 3 a 0.70377912
## 4 a 0.40702740
## 5 b 0.87109466
## 6 b 0.48599201
## 7 b 0.24660803
## 8 b 0.40431428
## 9 c 0.65827318
## 10 c 0.75715090
## 11 c 0.98883031
## 12 c 0.52909527
## 13 d 0.03079849
## 14 d 0.30094221
## 15 d 0.42010827
## 16 d 0.07955623
select(example.df, id, t, var1)## id t var1
## 1 a 1 0.25439062
## 2 a 2 0.18972348
## 3 a 3 0.70377912
## 4 a 4 0.40702740
## 5 b 1 0.87109466
## 6 b 2 0.48599201
## 7 b 3 0.24660803
## 8 b 4 0.40431428
## 9 c 1 0.65827318
## 10 c 2 0.75715090
## 11 c 3 0.98883031
## 12 c 4 0.52909527
## 13 d 1 0.03079849
## 14 d 2 0.30094221
## 15 d 3 0.42010827
## 16 d 4 0.07955623
You can also conveniently select a single column using $, like this:
example.df$t## [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Data manipulation commands can be combined:
filter(select(example.df,
id,
var1),
id == "a")## id var1
## 1 a 0.2543906
## 2 a 0.1897235
## 3 a 0.7037791
## 4 a 0.4070274
In the previous example we used == to filter rows where id was “a”. Other relational and logical operators are listed below.
| Operator | Meaning |
|---|---|
| == | equal to |
| != | not equal to |
| > | greater than |
| >= | greater than or equal to |
| < | less than |
| <= | less than or equal to |
| %in% | contained in |
| & | and |
| | | or |
You can modify data.frames using the mutate() function. It works like this:
example.df## id t var1 var2
## 1 a 1 0.25439062 a
## 2 a 2 0.18972348 c
## 3 a 3 0.70377912 b
## 4 a 4 0.40702740 c
## 5 b 1 0.87109466 c
## 6 b 2 0.48599201 c
## 7 b 3 0.24660803 a
## 8 b 4 0.40431428 a
## 9 c 1 0.65827318 b
## 10 c 2 0.75715090 c
## 11 c 3 0.98883031 a
## 12 c 4 0.52909527 a
## 13 d 1 0.03079849 b
## 14 d 2 0.30094221 a
## 15 d 3 0.42010827 c
## 16 d 4 0.07955623 b
## modify example.df and assign the modified data.frame the name example.df
example.df <- mutate(example.df,
var2 = var1/t, # replace the values in var2
var3 = 1:length(t), # create a new column named var3
var4 = factor(letters[t]),
t = NULL # delete the column named t
)## examine our changes
example.df## id var1 var2 var3 var4
## 1 a 0.25439062 0.25439062 1 a
## 2 a 0.18972348 0.09486174 2 b
## 3 a 0.70377912 0.23459304 3 c
## 4 a 0.40702740 0.10175685 4 d
## 5 b 0.87109466 0.87109466 5 a
## 6 b 0.48599201 0.24299600 6 b
## 7 b 0.24660803 0.08220268 7 c
## 8 b 0.40431428 0.10107857 8 d
## 9 c 0.65827318 0.65827318 9 a
## 10 c 0.75715090 0.37857545 10 b
## 11 c 0.98883031 0.32961010 11 c
## 12 c 0.52909527 0.13227382 12 d
## 13 d 0.03079849 0.03079849 13 a
## 14 d 0.30094221 0.15047110 14 b
## 15 d 0.42010827 0.14003609 15 c
## 16 d 0.07955623 0.01988906 16 d
Now that we have made some changes to our data set, we might want to save those changes to a file.
# write data to a .csv file
write_csv(example.df, path = "example.csv")
# write data to an R file
write_rds(example.df, path = "example.rds")
# write data to a Stata file
library(haven)
write_dta(example.df, path = "example.dta")In addition to importing individual datasets, R can save and load entire workspaces
ls() # list objects in our workspace## [1] "a" "baby.names"
## [3] "births.by.year" "comet"
## [5] "comet.plot" "example.df"
## [7] "fit" "name.length.by.location"
## [9] "nwbuilding" "orig.search.path"
## [11] "popular.girl.names" "w"
## [13] "W" "worldpop"
## [15] "x" "X"
## [17] "y" "Y"
## [19] "z" "Z"
save.image(file="myWorkspace.RData") # save workspace
rm(list=ls()) # remove all objects from our workspace
ls() # list stored objects to make sure they are deleted## character(0)
Load the “myWorkspace.RData” file and check that it is restored
load("myWorkspace.RData") # load myWorkspace.RData
ls() # list objects## [1] "a" "baby.names"
## [3] "births.by.year" "comet"
## [5] "comet.plot" "example.df"
## [7] "fit" "name.length.by.location"
## [9] "nwbuilding" "orig.search.path"
## [11] "popular.girl.names" "w"
## [13] "W" "worldpop"
## [15] "x" "X"
## [17] "y" "Y"
## [19] "z" "Z"
Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names.
baby.names to show only names given to at least 5 percent of boys.baby.names to include only names given to at least 3 percent of Girls. Save this to a Stata data set named “popularGirlNames.dta”)filter(baby.names, Sex == "M" & Percent >= 5)## # A tibble: 0 × 7
## # ... with 7 variables: Location <chr>, Year <int>, Sex <chr>, Name <chr>,
## # Count <dbl>, Percent <dbl>, Name.length <int>
baby.names <- mutate(baby.names, Proportion = Percent/100)
popular.girl.names <- filter(baby.names, Sex == "F" & Percent >= 3)
write_csv(popular.girl.names, path = "popularGirlNames.dta")Descriptive statistics of single variables are straightforward:
sum(example.df$var1) # calculate sum of var 1## [1] 7.327684
mean(example.df$var1)## [1] 0.4579803
median(example.df$var1)## [1] 0.4135678
sd(example.df$var1) # calculate standard deviation of var1## [1] 0.2785311
var(example.df$var1)## [1] 0.07757959
## summaries of individual columns
summary(example.df$var1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0308 0.2524 0.4136 0.4580 0.6696 0.9888
summary(example.df$var2)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01989 0.09952 0.14530 0.23890 0.27320 0.87110
## summary of whole data.frame
summary(example.df)## id var1 var2 var3 var4
## a:4 Min. :0.0308 Min. :0.01989 Min. : 1.00 a:4
## b:4 1st Qu.:0.2524 1st Qu.:0.09952 1st Qu.: 4.75 b:4
## c:4 Median :0.4136 Median :0.14525 Median : 8.50 c:4
## d:4 Mean :0.4580 Mean :0.23893 Mean : 8.50 d:4
## 3rd Qu.:0.6696 3rd Qu.:0.27320 3rd Qu.:12.25
## Max. :0.9888 Max. :0.87109 Max. :16.00
Some of these functions (e.g., summary) will also work with data.frames and other types of objects, others (such as sd) will not.
The summarize function can be used to calculate statistics by grouping variable. Here is how it works.
summarize(group_by(example.df, id), mean(var1), sd(var1))## # A tibble: 4 × 3
## id `mean(var1)` `sd(var1)`
## <fctr> <dbl> <dbl>
## 1 a 0.3887302 0.2289406
## 2 b 0.5020022 0.2653643
## 3 c 0.7333374 0.1942449
## 4 d 0.2078513 0.1839622
You can group by multiple variables:
summarize(group_by(example.df, id, var3), mean(var1), sd(var1))## Source: local data frame [16 x 4]
## Groups: id [?]
##
## id var3 `mean(var1)` `sd(var1)`
## <fctr> <int> <dbl> <dbl>
## 1 a 1 0.25439062 NA
## 2 a 2 0.18972348 NA
## 3 a 3 0.70377912 NA
## 4 a 4 0.40702740 NA
## 5 b 5 0.87109466 NA
## 6 b 6 0.48599201 NA
## 7 b 7 0.24660803 NA
## 8 b 8 0.40431428 NA
## 9 c 9 0.65827318 NA
## 10 c 10 0.75715090 NA
## 11 c 11 0.98883031 NA
## 12 c 12 0.52909527 NA
## 13 d 13 0.03079849 NA
## 14 d 14 0.30094221 NA
## 15 d 15 0.42010827 NA
## 16 d 16 0.07955623 NA
Earlier we learned how to write a data set to a file. But what if we want to write something that isn’t in a nice rectangular format, like the output of summary? For that we can use the sink() function:
sink(file="output.txt", split=TRUE) # start logging
print("This is the summary of example.df \n")## [1] "This is the summary of example.df \n"
print(summary(example.df))## id var1 var2 var3 var4
## a:4 Min. :0.0308 Min. :0.01989 Min. : 1.00 a:4
## b:4 1st Qu.:0.2524 1st Qu.:0.09952 1st Qu.: 4.75 b:4
## c:4 Median :0.4136 Median :0.14525 Median : 8.50 c:4
## d:4 Mean :0.4580 Mean :0.23893 Mean : 8.50 d:4
## 3rd Qu.:0.6696 3rd Qu.:0.27320 3rd Qu.:12.25
## Max. :0.9888 Max. :0.87109 Max. :16.00
sink() ## sink with no arguments turns logging offbirths.by.year.name.length.by.location.sum(baby.names$Count)## [1] 76865321
sum(filter(baby.names, Location == "MA")$Count)## [1] 1232841
births.by.year <- summarize(group_by(baby.names, Year), sum(Count))
mean(baby.names$Name.length)## [1] 5.978752
name.length.by.location <- summarize(group_by(baby.names, Location), mean(Name.length))Thanks to classes and methods, you can plot() many kinds of objects:
plot(example.df$var4)Thanks to classes and methods, you can plot() many kinds of objects:
plot(select(example.df, id, var1))Thanks to classes and methods, you can plot() many kinds of objects:
plot(select(example.df, id, var4))plot(select(example.df, var1, var2))